Refactor Ray Cluster/AppWrapper creation #650

Bobbins228 · 2024-08-27T11:42:38Z

Issue link

Closes: RHOAIENG-10385 and RHOAIENG-8846

What changes have been made

Removed base_template.yaml
Removed generate_yaml.py in favour of build_ray_cluster.py
Added V1RayCluster API
Removed unneeded labels
Removed head_info
Removed machine_types
Changed app_wrapper_yaml to resource_yaml and removed app_wrapper_name
Made ray_version a variable for potential future automation
Set the image config variable to default to quay.io/rhoai/ray:2.23.0-py39-cu121
Moved is_openshift_cluster function to cluster.py
Changed create_app_wrapper to create_resource
Removed enableIngress catch logic
Removed app_wrapper_name
Removed _components_resources_down in favour of _delete_resources
Refactored the get_cluster method to generate a new ClusterConfiguration with just the name and namespace of the cluster and retrieved yaml. Added is_appwrapper bool so that users can get AppWrappers/Ray Clusters
Added _retrieved_cluster boolean for get_cluster command to avoid generating a "false" Ray Cluster via ClusterConfiguration
Updated Cluster Configuration documentation and added new doc for methods used when interacting with Ray Clusters/AppWrappers.

Verification steps

Setup

Notebook server ODH/RHOAI/Local

Clone this repository with git clone https://github.com/project-codeflare/codeflare-sdk.git
Checkout this PR's branch
Run poetry build - install if needed (pip install poetry)
Run pip install --force-reinstall dist/codeflare_sdk-0.0.0.dev0-py3-none-any.whl
Restart your notebook kernel

Testing

All `ClusterConfiguration` parameters must be tested with the new cluster creation method.

Keep a special eye out for the following as they were the most complex to implement:

worker_extended_resource_requests
head_extended_resource_requests
extended_resource_mapping
The notebook annotations.

Automated Notebook testing should cover the functionality changed but I still suggest all parameters should be human verified.

Test the new and improved `get_cluster()` function.

NOTE: You can compare the original & retrieved clusters by setting write_to_file=True on ClusterConfiguration and get_cluster()

Create a Ray Cluster
Restart your notebook kernel
Run cluster = get_cluster(cluster_name=<name>, namespace=<namespace>, is_appwrapper=False, write_to_file=True)
Run through the basic cluster. methods
Run cluster.down() then cluster.up()
Ensure the cluster is created successfully
Repeat steps for AppWrapper (Remember to enable AppWrappers in the CFO configmap!)

TODO

Add ImagePullSecrets # Done
Revise V1RayCluster

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- Testing is not required for this change

openshift-ci · 2024-08-27T11:42:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from bobbins228. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Fiona-Waters

Magnum opus for sure @Bobbins228 ! Will continue reviewing but some nitpicks and a question so far.

docs/ray_cluster_interaction.md

src/codeflare_sdk/cluster/cluster.py

KPostOffice

I'll give this another pass tomorrow. Good work!

KPostOffice · 2024-08-27T18:41:46Z

src/codeflare_sdk/models/v1_autoscaler_options.py

+
+    @imagePullPolicy.setter
+    def imagePullPolicy(self, imagePullPolicy):
+        self._imagePullPolicy = imagePullPolicy
+
+    @property
+    def securityContext(self):
+        return self._securityContext
+
+    @securityContext.setter
+    def securityContext(self, securityContext):
+        self._securityContext = securityContext
+
+    @property
+    def idleTimeoutSeconds(self):
+        return self._idleTimeoutSeconds
+
+    @idleTimeoutSeconds.setter
+    def idleTimeoutSeconds(self, idleTimeoutSeconds):
+        self._idleTimeoutSeconds = idleTimeoutSeconds
+
+    @property
+    def upscalingMode(self):
+        return self._upscalingMode
+
+    @upscalingMode.setter
+    def upscalingMode(self, upscalingMode):
+        self._upscalingMode = upscalingMode
+
+    @property
+    def env(self):
+        return self._env
+
+    @env.setter
+    def env(self, env):
+        self._env = env
+
+    @property
+    def envFrom(self):
+        return self._envFrom
+
+    @envFrom.setter
+    def envFrom(self, envFrom):
+        self._envFrom = envFrom
+
+    @property
+    def volumeMounts(self):
+        return self._volumeMounts
+
+    @volumeMounts.setter
+    def volumeMounts(self, volumeMounts):
+        self._volumeMounts = volumeMounts
+
+    def to_dict(self):
+        """Returns the model properties as a dict"""
+        result = {}
+
+        for attr, _ in six.iteritems(self.openapi_types):
+            value = getattr(self, attr)
+            if isinstance(value, list):
+                result[attr] = list(
+                    map(lambda x: x.to_dict() if hasattr(x, "to_dict") else x, value)
+                )
+            elif hasattr(value, "to_dict"):
+                result[attr] = value.to_dict()
+            elif isinstance(value, dict):
+                result[attr] = dict(
+                    map(
+                        lambda item: (item[0], item[1].to_dict())
+                        if hasattr(item[1], "to_dict")
+                        else item,
+                        value.items(),
+                    )
+                )
+            else:
+                result[attr] = value
+
+        return result
+
+    def to_str(self):
+        """Returns the string representation of the model"""
+        return pprint.pformat(self.to_dict())
+
+    def __repr__(self):
+        """For `print` and `pprint`"""
+        return self.to_str()
+
+    def __eq__(self, other):
+        """Returns true if both objects are equal"""
+        if not isinstance(other, V1AutoScalerOptions):
+            return False
+
+        return self.to_dict() == other.to_dict()
+
+    def __ne__(self, other):
+        """Returns true if both objects are not equal"""
+        if not isinstance(other, V1AutoScalerOptions):
+            return True
+
+        return self.to_dict() != other.to_dict()


Are these auto-generated or have you created them by hand?

Can you give more info about how they were created?

My main concern is how will we keep these up to date as the Ray API Spec evolves. I think this is definitely the correct approach I just want to raise this concern now.

I wanted to auto generate these but from what I can see in the KubeRay repo there is no valid Open API generator file -> they should look like this for ref: pet-store.yaml

So I based the models files on the way they should be auto generated from what I can see in the Python K8s API and the specs from raycluster_types.go.

I am not a fan of how I generated these files but given the alternative is to re-create the base template within build_ray_cluster.py I am not sure how we should proceed.

KPostOffice · 2024-08-27T18:46:11Z

src/codeflare_sdk/cluster/cluster.py

+        write_to_file=write_to_file,
+        appwrapper=is_appwrapper,
+    )
+    cluster = Cluster(cluster_config, is_retrieved_cluster=True)


Can we turn this into something like:

cluster = Cluster.cluster_from_k8

I'm not a huge fan of is_retrieved_cluster flag leaking out into public apis. It feels very awkward to me.

Can you elaborate a bit more on this?

That bool exists to prevent a Ray Cluster from being built in build_ray_cluster.py. Other wise when we create the limited ClusterConfiguration the Cluster object is created but you would get duplicated print statements for "Written to: {output_file_name} from first creating the limited cluster then from creating a file from the retrieved cluster.

You can create a class function which has different logic which also returns a cluster. Something like:

cluster = Cluster.new_cluster_skip_create(cluster_config)

You are a hero, that is a great idea!

@KPostOffice WDYT of the new changes for get_cluster I have added?

When trying to initialise the Cluster object I was getting exception errors for not providing a ClusterConfiguration. I tried to think of a way around this and I opted to set the CC as none when using get_cluster which would allow me to set the config and resource_yaml after initialisation.

I in turn added a warning if a user tried to specify ClusterConfiguration as None. I feel like this is still a bit crude and would welcome any further suggestions.

Do you want to spend some time pair programming this tomorrow? We can spitball some ideas

Yeah sounds good I can set up a meeting for later

src/codeflare_sdk/cluster/cluster.py

tests/test-case-custom-image.yaml

src/codeflare_sdk/utils/build_ray_cluster.py

Signed-off-by: Bobbins228 <[email protected]>

…ction doc Signed-off-by: Bobbins228 <[email protected]>

openshift-merge-robot · 2024-09-18T13:07:02Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Bobbins228 · 2024-10-31T16:32:07Z

Closing in favour of #751

Bobbins228 added the test-guided-notebooks Run PR check to verify Guided notebooks label Aug 27, 2024

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 27, 2024

openshift-ci bot requested review from ChristianZaccaria and KPostOffice August 27, 2024 11:42

sutaakar added test-guided-notebooks Run PR check to verify Guided notebooks and removed test-guided-notebooks Run PR check to verify Guided notebooks labels Aug 27, 2024

Bobbins228 force-pushed the refactor-ray-cluster branch from e5e8b9b to f25fddc Compare August 27, 2024 13:01

Bobbins228 changed the title ~~[WIP] Refactor Ray Cluster/AppWrapper creation~~ Refactor Ray Cluster/AppWrapper creation Aug 27, 2024

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 27, 2024

Bobbins228 added test-guided-notebooks Run PR check to verify Guided notebooks and removed test-guided-notebooks Run PR check to verify Guided notebooks labels Aug 27, 2024

Fiona-Waters reviewed Aug 27, 2024

View reviewed changes

Bobbins228 force-pushed the refactor-ray-cluster branch 2 times, most recently from cfc037f to dab3020 Compare August 27, 2024 15:52

KPostOffice reviewed Aug 27, 2024

View reviewed changes

Bobbins228 force-pushed the refactor-ray-cluster branch from dab3020 to ee5ddc7 Compare September 2, 2024 14:11

Bobbins228 added test-guided-notebooks Run PR check to verify Guided notebooks and removed test-guided-notebooks Run PR check to verify Guided notebooks labels Sep 2, 2024

Bobbins228 added 3 commits September 4, 2024 16:11

refactor: refactor creation of ray cluster/appwrapper

dc5b27f

Signed-off-by: Bobbins228 <[email protected]>

test: update unit tests after resource creation refactor

ef13441

Signed-off-by: Bobbins228 <[email protected]>

docs: update cluster configuration doc and add new ray cluster intera…

2ef248c

…ction doc Signed-off-by: Bobbins228 <[email protected]>

Bobbins228 force-pushed the refactor-ray-cluster branch from ee5ddc7 to 2ef248c Compare September 4, 2024 15:13

Bobbins228 added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 5, 2024

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 18, 2024

removed comment

72caabc

Bobbins228 closed this Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Ray Cluster/AppWrapper creation #650

Refactor Ray Cluster/AppWrapper creation #650

Bobbins228 commented Aug 27, 2024 •

edited

Loading

openshift-ci bot commented Aug 27, 2024

Fiona-Waters left a comment

KPostOffice left a comment

KPostOffice Aug 27, 2024 •

edited

Loading

Bobbins228 Sep 2, 2024

KPostOffice Aug 27, 2024

Bobbins228 Sep 2, 2024

KPostOffice Sep 3, 2024

Bobbins228 Sep 3, 2024

Bobbins228 Sep 4, 2024

KPostOffice Sep 4, 2024

Bobbins228 Sep 5, 2024

openshift-merge-robot commented Sep 18, 2024

Bobbins228 commented Oct 31, 2024

Refactor Ray Cluster/AppWrapper creation #650

Refactor Ray Cluster/AppWrapper creation #650

Conversation

Bobbins228 commented Aug 27, 2024 • edited Loading

Issue link

What changes have been made

Verification steps

Setup

Notebook server ODH/RHOAI/Local

Testing

All ClusterConfiguration parameters must be tested with the new cluster creation method.

Test the new and improved get_cluster() function.

TODO

Checks

openshift-ci bot commented Aug 27, 2024

Fiona-Waters left a comment

Choose a reason for hiding this comment

KPostOffice left a comment

Choose a reason for hiding this comment

KPostOffice Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-merge-robot commented Sep 18, 2024

Bobbins228 commented Oct 31, 2024

Bobbins228 commented Aug 27, 2024 •

edited

Loading

All `ClusterConfiguration` parameters must be tested with the new cluster creation method.

Test the new and improved `get_cluster()` function.

KPostOffice Aug 27, 2024 •

edited

Loading